Magnolia: A novel DHT Architecture for Keyword-based Searching

نویسندگان

  • Ashish Gupta
  • Manan Sanghi
  • Peter Dinda
  • Fabian Bustamante
چکیده

The class of DHT-based P2P systems like Chord, Pastry, Tapestry, Kademlia greatly improve over unstructured P2P systems like Gnutella and Kazaa by providing (1) Scalable and efficient O(log(n)) lookup and routing for any document (2) Good load balancing properties for very large number of keys or documents. However, to lookup a document, its complete initial identifier must be known to compute its unique hashed key and route to the correct node, which is a major disadvantage compared to unstructured systems. Our goal in this ongoing project is to create a DHT-based P2P architecture that supports efficient partial keyword searches in a scalable manner. Some recent proposals for keyword search [2], [4], [1], [5] have suggested storing all documents pointers for a keyword on a node corresponding to keyID=h(keyword). For example, all files which have ”usenix” in their title are stored on a single node corresponding to h(”usenix”). Multiple keyword search can then be made possible by computing the hashes for each keyword and visiting corresponding nodes to fetch all results (which can be processed in the network for boolean operations before returning). Though correct, we argue that this approach does not align well with the goals of a DHT system for very large scale and transient networks. High amount of keyword heterogeneity in occurrence frequency as well as query frequency further aggravate the problem (These have been shown to follow Zipf distribution): (1) Millions of documents corresponding to a common keyword can end up on a single node. Overall, distribution of these document pointers can be heavily skewed over the nodes (2) When a node disappears, all document pointers corresponding to keyword(s) stored on this node are removed from the network, hampering future searches. This is especially problematic if the nodes storing pointers for popular keywords fail. (3) Nodes can be swamped with search traffic for these popular keywords creating routing hotspots (resulting from routing large number of messages to a single destination) as well as query hotspots. We have designed a simple DHT architecture Magnolia which is not effected by the fore-mentioned problems while simultaneously providing log n hops for routing and lookup and low, bounded number of nodes visited and traffic generated. Our model scenario is a large scale P2P file sharing system with over 1 million nodes which show high transiency and is responsible for storing over 1 billion documents. Our architecture proposes novel node grouping and key distribution methods using a multi-hashing scheme and makes use of hash function properties to effectively distribute pointers corresponding to every keyword to a tunable number of nodes. Using Multi-hashing each keyword is balanced across a set of nodes in the system with little overlap between different set of nodes, which achieves both good load-balance in terms of traffic and key storage as well as making search highly robust to failures. We want to form these groups such that popular keywords have low probability of being assigned to the same group. We also propose a modified DHT routing architecture which can then store documents and lookup keyword queries in log(n) hops , though the keyword pointers are mapped to multiple nodes. The amount of traffic generated and number of nodes visited is also low and bounded. Figure 1 shows the technique of multi-hashing. We have k hash functions h1(), ..., hk() where hi() maps a keyword a m′ bit key (m′ < m, the total number of bits used in nodeID or documentID). For each keyword corresponding to every document instance (which we assume currently are derived from its title or meta information like wi

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

DHT Based Searching Improved by Sliding Window

Efficient full-text searching is a big challenge in Peer-to-Peer (P2P) system. Recently, Distributed Hash Table (DHT) becomes one of the reliable communication schemes for P2P. Some research efforts perform keyword searching and result intersection on DHT substrate. Two or more search requests must be issued for multi-keyword query. This article proposes a Sliding Window improved Multi-keyword ...

متن کامل

E-Chord: Keyword-Based Search Algorithm Based on DHT in Mediation Architecture

E-Chord is a Distributed Hash Table algorithm (DHT) inspired from Chord, Pastry and CAN algorithms. It meant to provide a keyword-based search in the Three-Layer Mediation Architecture. The E-Chord is deployed in the middle layer (Integration Layer) of the architecture and provides services to the higher layer (Presence Layer) and the lower layer (Homogenization Layer).

متن کامل

KEYNOTE: Keyword Search by Node Selection for Text Retrieval on DHT-Based P2P Networks

Efficient full-text keyword search remains a challenging problem in P2P systems. Most of the traditional keyword search systems on DHT overlay networks perform the join operation of keywords at document level, which consumes a huge amount of storage and bandwidth cost. In this paper, we present KEYNOTE, a novel keyword search system that performs the join operation at node level. Compared to th...

متن کامل

Data Indexing and Querying in DHT Peer-to-Peer Networks

Peer-to-peer DHT systems, such as Chord [8], CAN [5], Pastry [6], or Tapestry [11], make it simple to discover specific data when their complete identifiers—or keys—are known in advance. In practice, however, users looking up resources stored in peer-to-peer systems often have only partial information for identifying these resources and tend to submit broad queries. In this paper, we describe t...

متن کامل

Enabling Semantic Search in Structured P2p Networks via Distributed Databases and Web Services

First generation of overlay P2P networks had scalability problems although offering useful keyword-base searching functionality. The new generation of structured networks, based on Distributed Hash Tables (DHT), solved the scalability problem but at the same time eliminated the possibility of performing searches by proximity. This can be considered as an effective limitation as users of P2P net...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2005